Heterogeneous-Reliability Memory: Exploiting Application-Level Memory Error Tolerance
نویسندگان
چکیده
Recent studies estimate that server cost contributes to as much as 57% of the total cost of ownership (TCO) of a datacenter [1]. One key contributor to this high server cost is the procurement of memory devices such as DRAMs, especially for data-intensive datacenter cloud applications that need low latency (such as web search, in-memory caching, and graph traversal). Such memory devices, however, may be prone to hardware errors that occur due to unintended bit flips during device operation [40, 33, 41, 20]. To protect against such errors, traditional systems uniformly employ devices with highquality chips and error correction techniques, both of which increase device cost. At the same time, we make the observations that 1) data-intensive applications exhibit a diverse spectrum of tolerance to memory errors, and 2) traditional one-size-fits-all memory reliability techniques are inefficient in terms of cost. Our DSN-44 paper [30] is the first to 1) understand how tolerant different data-intensive applications are to memory errors and 2) design a new memory system organization that matches hardware reliability to application tolerance in order to reduce system cost. The main idea of our approach is to classify applications based on their memory error tolerance, and map applications to heterogeneous-reliability memory system designs managed cooperatively between hardware and software to reduce system cost. Our DSN-44 paper provides the following contributions:
منابع مشابه
Exploiting Memory Device Wear-Out Dynamics to Improve NAND Flash Memory System Performance
This paper advocates a device-aware design strategy to improve various NAND flash memory system performance metrics. It is well known that NAND flash memory program/erase (PE) cycling gradually degrades memory device raw storage reliability, and sufficiently strong error correction codes (ECC) must be used to ensure the PE cycling endurance. Hence, memory manufacturers must fabricate enough num...
متن کاملA Fault-Tolerant 176 Gbit Solid State Mass Memory Architecture
This paper presents a new Solid State Mass Memory (SSMM) suitable for space applications. The memory reliability is increased by using two different approaches. Firstly, memory mass fault-tolerance, with respect to hard failures, is obtained by using a fine-granularity hierarchical structure with a certain level of redundancy. A second strategy used for facing soji errors is based on Error Corr...
متن کاملSystem Effects of Single Event Upsets
At the system level, SEUs in processors are controlled by fault-tolerance techniques such as replication and voting, watchdog processors, and tagged data schemes [13,16,30]. SEUs in memory subsystems are controlled by use of error control codes (ECCs) [4,17,21] and a process called scrubbing. The scrubbing process periodically reads each word in the memory. If the number of faulty digits in a w...
متن کاملReplication for Efficiency and Fault Tolerance in a Dsm System
Distributed Shared Memory (DSM) systems implemented on a network of workstations (NOW) have become a convenient alternative to shared memory archi-tectures to execute long running parallel applications. However, such architectures are susceptible to experience failures. This paper presents the design and implementation of a recoverable DSM (RDSM) based on a backward error recovery (BER) mechani...
متن کاملPartially-Forgetful Memories: Relaxing Memory Guard-bands for Approximate Computing
While the memory subsystem is already a major contributor to energy consumption of computing platforms, the guardbanding required for masking the effects of ever increasing manufacturing variations in memories imposes even more energy overhead. In this paper, we explore how PartiallyForgetful Memories can be used by exploiting the intrinsic tolerance of a vast class of applications to some leve...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1602.00729 شماره
صفحات -
تاریخ انتشار 2015